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ABSTRACT 

Many social Web sites allow users to annotate the content 
with descriptive metadata, such as tags, and more recently 
to organize content hierarchically. These types of structured 
metadata provide valuable evidence for learning how a com- 
munity organizes knowledge. For instance, we can aggre- 
gate many personal hierarchies into a common taxonomy, 
also known as a folksonomy, that will aid users in visualiz- 
ing and browsing social content, and also to help them in 
organizing their own content. However, learning from social 
metadata presents several challenges, since it is sparse, shal- 
low, ambiguous, noisy, and inconsistent. We describe an ap- 
proach to folksonomy learning based on relational clustering, 
which exploits structured metadata contained in personal 
hierarchies. Our approach clusters similar hierarchies using 
their structure and tag statistics, then incrementally weaves 
them into a deeper, bushier tree. We study folksonomy 
learning using social metadata extracted from the photo- 
sharing site Flickr, and demonstrate that the proposed ap- 
proach addresses the challenges. Moreover, comparing to 
previous work, the approach produces larger, more accurate 
folksonomies, and in addition, scales better. 

Categories and Subject Descriptors 

H.2.8 [DATABASE MANAGEMENT]: Database Ap- 
phcations— Daia mining; 1.2.6 [ARTIFICIAL INTELLI- 
GENCE]: Learning — Knowledge Acguisition 

General Terms 
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1. INTRODUCTION 

The social Web has changed the way people create and 
use information. Sites like Flickr, Del.icio.us, YouTube, and 
others, allow users to publish and organize content by anno- 
tating it with descriptive keywords, or tags. Some web sites 
also enable users to organize content hierarchically. The 
photo-sharing site Flickr, for example, allows users to group 
related photos in sets, and related sets in collections. Al- 
though these types of social metadata lack formal structure, 
they capture the collective knowledge of Social Web users. 
Once mined from the traces left by many users, such collec- 
tive knowledge will add a rich semantic layer to the content 
of the Social Web that will potentially support many tasks in 
information discovery such as personalization, data mining, 
and information management. 

A community's knowledge can be expressed through a 
common taxonomy, also called a folksonomy, that is learned 
from social metadata created by many users. Compared to 
existing hierarchies, such as Linnaean classification system 
or WordNet, automatically learned folksonomies are attrac- 
tive because they (1) represent collective agreement of many 
individuals; (2) are relatively inexpensive to obtain; (3) can 
adapt to evolving vocabularies and community's information 
needs; and (4) they are directly tied to the annotated con- 
tent. A folksonomy can facilitate browsing of user- generated 
content, and help users visualize how their own content fits 
within the community's or aid them in organizing it. 

Learning a folksonomy by integrating structured meta- 
data created by many users presents a number of challenges. 
Since users are free to annotate data according to their own 
preferences, social metadata is noisy, shallow, sparse, am- 
biguous, conflicting, multi-faceted, and expressed at incon- 
sistent granularity levels across many users. Several recent 
works have addressed some of the above challenges. For in- 
stance, [3 [IS] proposed inducing folksonomies from tags by 
utilizing tag statistics. The basic motivation behind these 
approaches is that more frequent tags describe more general 
concepts. However, frequency-based methods cannot distin- 
guish between more general and more popular concepts. In 
our previous work, siG |12) . we overcame this problem by 
using user-specified relations, extracted from personal hier- 
archies. Nevertheless, it ignored other evidence, e.g., struc- 
ture of hierarchies and tags, which potentially address the 
challenges listed above. 

We propose a novel approach to learn folksonomies from 
social metadata in the form of tags and user-specified shal- 
low hierarchies. Our approach is driven by a similarity mea- 



sure that utilizes statistics of both kinds of metadata to in- 
crementally weave individual hierarchies into a deeper, more 
complete folksonomy. The approach has several advantages 
over previous work. Specifically, it: (1) better addresses the 
challenges of sparse, shallow, ambiguous, noisy and inconsis- 
tent data; (2) the approach is more scalable, especially when 
the learned folksonomies are deep; (3) it produces more con- 
sistent and richer folksonomies. We demonstrate the utility 
of our present approach on real-world data from Flickr, and 
introduce a simple metric, which evaluates the quality of the 
learned folksonomies in terms of depth and bushiness. 

2. STRUCTURED SOCIAL METADATA 

In addition to tagging content, some social Web sites also 
allow users to organize it hierarchically. Delicious users can 
group related tags into bundles, and Flickr users can group 
related photos into sets and then group related sets in col- 
lections. While the sites themselves do not impose any con- 
straints on the vocabulary or semantics of the hierarchies, 
in practice users employ them to represent both subclass 
relationships ('dog' is a kind of 'mammar) and part-of re- 
lationship ('my kids' is a part of 'family'). Users appear to 
express both types of relations (and possibly others) through 
personal hierarchies, in effect using the hierarchies to spec- 
ify broader/narrower relations. Even without strict seman- 
tics being attached to these relations, we believe that per- 
sonal hierarchies represent a novel, rich source of evidence 
for learning folksonomies. 
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Flickr allows users to group their photos in album-like fold- 
ers, called sets. Users can also group sets into "super" al- 
bums, called collections^ Both sets and collections are 
named by the owner of the image. A photo can be part 
of multiple sets. 

While Flickr does not enforce any specific rules about how 
to organize photos or how to name them, most users group 
"similar" or "related" photos into the same set and related 
sets into the same collection. Some users create multi-level 
hierarchies containing collections of collections, etc., but the 
vast majority of users create shallow hierarchies, consist- 
ing of collections and their constituent sets. Figure [IJa) 
shows some of the collections created by an avid naturalist 
on Flickr. These collections reflect the subjects she likes to 
photograph: Birds, Mammals, Plants, Mushrooms &. Fungi, 
Plant Pests, Plant Diseases, etc. Figure [ijb) shows sets of 
the Plant Pests collection: Plant Parasites, Sap Suckers, 
Plant Eaters, and Caterpillars. Each set contains one or more 
photos, which are tagged by the user. For example, a pho- 
tograph in the set Caterpillars (Figure [He)), is annotated 
with multiple tags describing it: (Animal, Lepidoptera, Moth, 
larva. Caterpillar), its color (Black and orange), con- 
dition (on Senecio, eating), and location (North Seatac 
Park, King County, WA, North America). 

3. CHALLENGES IN LEARNING FROM 
STRUCTURED METADATA 

Learning folksonomies from social metadata, specifically, 
from structured metadata, presents a number of challenges: 

3.1 Sparseness 

Social metadata is usually very sparse. Users provide 4 
-7 tags per bookmark on Delicious in our data set and 3.74 
tags per photo on Flickr [13]. Sparseness is also manifested 
in the hierarchical organization created by an individual. In 
our Flickr data set, we found only 600 out of 21, 792 users — 
approximately 0.02 percent — who created multi-level (col- 
lections of collections) hierarchies. Most users define shallow 
(single-level) hierarchies. Moreover, among these shallow hi- 
erarchies, few users organize content the same way. For in- 
stance, of the 433 users who created an animal collection, 
only a few created common child sets, such as bird, cat, dog 
or insect. In order to learn a rich and complete folksonomy, 
we have to aggregate social metadata from many different 
users. 

3.2 Noisy vocabulary 

Vocabulary noise has several sources. One common source 
is variations and errors in spelling. Noise also arises from 
users' idiosyncratic naming conventions. While such names 
as not sure, pleaseaddthistothethemecomppoll, mykid may 
be meaningful to image owner and her narrow interest group, 
they are relatively meaningless to other users. 



Figure 1: Personal hierarchies specified by a Flickr 
user, (a) Some of the collections created by the user 
and (b) sets associated with the Plant Pests collec- 
tion, and (c) tags associated with an image in the 
Caterpillars set. 

We briefly describe how this feature is implemented on the 
social photo-sharing site, Flickr (http://www.flickr.com). 



3.3 Ambiguity 

An individual tag is often ambiguous [5] [3]. For exam- 
ple, jaguar can be used to refer to a mammal or a luxury 
car. Similarly, terms that are used to name collections and 



^The collection feature is limited to paid "pro" users. Pro 
users can also create unlimited number of photo sets, while 
free membership limits a user to three sets. 
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(c) Varying granularity 

Figure 2: Schematic diagrams of personal hierar- 
chies created by Flickr users, (a) Ambiguity: the 
same term may have different meaning ("turkey" can 
refer to a bird or a country), (b) Conflict: users' 
different organization schemes can be incompatible 
(china is a parent of travel in one hierarchy, but 
the other way around in another), (c) Granular- 
ity: users have different levels of expressiveness and 
specificity, and even mix different specificity levels 
within the same hierarchy (Scotland (country) and 
London (city) are both children of UK). Nodes are col- 
ored to aid visualization. 



sets can refer to different concepts. Consider the hierar- 
chy in Figure [2] (a), where turkey collection could be about 
a bird or a country. Similarly, victoria can either be a 
place in Canada or Australia. When combining metadata 
to learn common folksonomies, we need to be aware of its 
meaning. Structural and contextual information may help 
disambiguate metadata. 

3.4 Structural noise and conflicts 

Like vocabulary noise, structural noise has a number of 
sources and can lead to inconsistent or conflicting structures. 
Structural noise can arise as a result of variations in indi- 
viduals' organization preferences. Suppose that, as shown 
in Figure [2] (b), user A organizes photos first by activity, 
creating a collection called travel, and as part of this collec- 
tion, a set called china, for photos of her travel in China. 
Meanwhile, user B organizes photos by location first, cre- 
ating a collection china, with constituent sets travel, people, 
food, etc. In one hierarchy, therefore, travel is more gen- 
eral than china, and in the second hierarchy, it is the other 
way around. Sometimes conflicts are caused by vocabulary 
differences among individual users. For example, to some 
users bug is a "pest," a term broader than insect, while to 
others it is a subclass of insect. As a result, some users may 
express bug — >■ insect, while the others express an inverse 
relation. Another source of noise is variation in degree of 
expertise on a topic. Many users assemble images of spiders 
in a set called spiders and assign it to an insect collection, 
while others correctly assign spiders to arachnid. 

3.5 Varying granularity level 

Differences in users' level of expertise and expressiveness 



may also lead to relatively imprecise metadata. Experts 
may use specific breed names to tag dog photos, while non- 
experts will simply use the tag dog to annotate them[5]. In 
addition, one user may organize photos first by country and 
then by city, while another organizes them by country, then 
subregion and then city, as shown in Figure[2](c). Combining 
data from these users potentially generates multiple paths 
from one concept to another. 

4. LEARNING FOLKSONOMIES FROM 
STRUCTURED METADATA 

We propose a simple, yet effective approach to combine 
many personal hierarchies into a global folksonomy that 
takes above challenges into account. We define a personal hi- 
erarchy as a shallow tree, a sapling, composed of a root node 
r' and its children, or leaf nodes {II, ../*■). The root node cor- 
responds to a user's collection, and inherits its name, while 
the leaf nodes correspond to the collection's constituent sets 
and inherit their names. Only a small number of users define 
multi-level hierarchies; for these, we decompose them and 
represent them as collections of saplings. At the top level, 
we have a root node, which corresponds to the top-level col- 
lection, and its leaf nodes corresponding to the root's sets 
or collections. We then construct saplings that correspond 
to the leaf nodes, which are collections, and so on. We as- 
sume that hierarchical relations between a root and its chil- 
dren, r' -^ Pj, specify broader-narrower relations. Hence, 
the sapling in Figure [T](b) is Plant Pests -^ {Plant Par- 
asites, Sap Suckers, Pleoit Eaters, Caterpillars }. 

In addition to hierarchical structure, each sapling carries 

information derived from tags. On Flickr, users attach tags 

only to photos; therefore, the tag statistics of a sapling's 

leaf (set) are aggregated from that set's constituent photos. 

Tag statistics are then propagated from the leaves to the 

parent node. In our example. Plant Parasites aggregates tag 

statistics from all photos in this set, and its parent Plant 

Pests contains tag statistics accumulated from all photos in 

Plant Parasites and its siblings. We define a tag statistic 

of node x as t^ := {{t\, fti),{t2, ft^),- ■ ■ {tk,ftk)}, where tk 

and ftf. are tag and its frequency respectively. Hence, r^t is 

aggregated from all r^s. 
j 
Given a collection of saplings, specified by many different 

users, our goal is to aggregate them into a common, denser 

and deeper tree. Before describing our approach, we first 

briefly describe data preprocessing steps that address the 

sparseness and noise challenges listed above. 

4.1 Data Preprocessing 

We extract terms representing concepts from collection 
and set names. We found that users often combine two or 
more concepts within a single name, e.g., "Dragonflies/Dam- 
selfiies", "Mushrooms & Fungi", "Moth at Night." Terms can 
be joined by bridging words that include prepositions "at", 
"of", "in," and conjunctions "and" and "or," or special char- 
acters, such as '&', '<', '>', ':', '/'. We start by tokenizing 
collection and set names on these words and characters. We 
do not tokenize on white spaces to avoid breaking up terms 
like "South Africa." We remove terms composed only of non- 
alphanumeric characters and frequently-used uninformative 
words, e.g., "me" and "myself." We then normalize all terms 
by lowercasing them. 

After tokenization, a set or collection name may be split 



into multiple terms, which we expand into leaves. Suppose 
a user created a collection animal containing a set cats and 
dogs. After tokenization we get the sapling animal — > 
{cats, dogs}. However, if the root node is determined to 
have a composite name, we ignore the entire sapling because 
we do not know which parent concepts correspond to which 
child concepts. 

4.2 Relational Clustering of Structured Meta- 
data 

In order to learn a folksonomy, we need to aggregate saplings 
both horizontally and vertically. By horizontal aggregation, 
we mean merging saplings with similar roots, which expands 
the breadth of the learned tree by adding leaves to the root. 
By vertical aggregation, we mean merging one sapling's leaf 
to the root of another, extending the depth of the learned 
tree. The approach we use exploits contextual information 
from neighbors in addition to local features to determine 
which saplings to merge. The approach is similar to rela- 
tional clustering[T] and its basic element is the similarity 
measure between a pair of nodes. 

We define a similarity measure which combines hetero- 
geneous evidence available in the structured social meta- 
data, and is a combination of local similarity and struc- 
tural similarity. The local similarity between nodes a and 
b, localSim{a, b), is based on the intrinsic features of a and 
b, such as their names and tag distributions. The structural 
similarity, structSim(a, b) is based on features of neighbor- 
ing nodes. If a is a root of a sapling, its neighboring nodes 
are all of its children. If a is a leaf node, the neighboring 
nodes are its parent and siblings. The similarity between 
nodes a and 6 is: 



nodesim{a,b) — (1 — a) x localSim{a,b) 
+ ax structSim{a,b), 



(1) 



where < a < 1 is a weight for adjusting contributions from 
localSim{, ) and structSim{, ). We judge whether two nodes 
are similar if the similarity is greater than the threshold, r. 

4.2.1 Local Similarity 

The local similarity of nodes a and 6 is composed of (1) 
name similarity and (2) tag distribution similarity. Name 
similarity can be any string similarity metric, which returns 
a value ranging from to 1. Tag similarity, tagSim{, ), can 
be any function for measuring the similarity of distributions. 
Because of the sparseness of the data, and to make the com- 
putation fast, we use a simple function which counts the 
number of common tags, n, in the top K tags of a and b; 
it returns 1 if this number is equal or greater than J, else 
it returns y. Local similary is a weighted combination of 
name and tag similarities: 

localSim{a,b) = P x nameSim{a,b) (2) 

-I- {1- P) xtagSim{a,b)). 

Tag similarity helps address the ambiguity challenge de- 
scribed in Section [3] For example, the top tags of the node 
turkey that refers to a bird include "bird", "beak", "feed", 
while the top tags of turkey that refers to the country in- 
clude different terms about places within the country. 

4.2.2 Structural Similarity 

Structural similarity between two nodes depends on po- 
sition of nodes within their saplings. We define two ver- 



sions: structSimRR{,) which computes structural similar- 
ity between two root nodes (root-to-root similarity), and 
struct SimLR{,) which evaluates structural similarity be- 
tween a root of one sapling and the leaf of another (leaf-to- 
root similarity). 

Root-tO-Root similarity. Two saplings A and B are likely 
to describe the same concept if their root nodes r and r^ 
have a similar name and some of their leaf nodes also have 
similar names. In this case, there is no need to compute 
tagSim{,) of these leaf nodes. We define the normalized 
common leaves factor, CIL, as ^ '^- . S{name{li ),name{lf)), 
where 5{., .) returns 1 if the both arguments are exactly the 
same; otherwise, it returns 0; name{li ) is a function that 
returns the name of a leaf node li of sapling A. Z is a nor- 
malizing constant, which is described in greater detail later. 
Structural similarity between two root nodes is then defined 
as follows: 



structSimRR{r , r ) 



CL + (l-CL) (3) 

X tagSim{tf-^g,tf^g), 



where fjtag is an aggregation of tag distributions of all if, 
at which name{lj ) 7^ name{lf) for any leaf node if of the 
sapling B. From Eq. [S] we compute similarity based on: 
(1) how many of their children have common name (they 
match); (2) the tag distribution similarity of those that do 
not have the same name. The second term is an optimistic 
estimate that child nodes of the two saplings refer to the 
same concept while having different names. 

The normalization coefficient Z = min{\l'^\, \i^\), where 
\l'^\ is a number of child nodes of X. We use min{, ) instead 
of union. The reason is that saplings aggregated from many 
small saplings will contain a large number of child nodes. 
When merging with a relatively small sapling, the fraction 
of common nodes may be very low compared to total number 
of child nodes. Hence, the normalization coefficient with the 
union (Z — union{l'^ ,1^)), as defined in Jaccard similarity, 
results in overly penalizing small saplings. min{,), on the 
other hand, seems to correctly consider the proportion of 
children of the smaller sapling that overlap with the larger 
sapling. 

When we decide that roots r and r^ are similar, we 
merge saplings A and B with the mergeByRoot{A, B) op- 
eration. This operation creates a new sapling, M, which 
combines structures and tag statistics of A and B. In par- 
ticular, the tag statistics of the root of M is a combination 
of those from r and r^ . The leaves of M, l^ , are a union 
of / and l^ . If there are leaves from A and B that share a 
name, their tag statistics will be combined and attached to 
the corresponding leaf in M. 

The width of the newly merged sapling will increase as 
more saplings are merged. Also, since we simply merge leaf 
nodes with similar names, and their roots also have similar 
names, leaf-to-leaf structural similarity struct SimLL{, ) is 
not required. This operation addresses the sparseness chal- 
lenge mentioned in Section [3] 

Root-tO-Leaf similarity. Merging the root node of one sam- 
pling with the leaf node of another sapling extends the depth 
of the learned folksonomy. Since we consider a pair of nodes 
with different roles, their neighboring nodes also have dif- 
ferent roles. This would appear to make them structurally 



incompatible. However, in many cases, some overlap be- 
tween siblings of one sapling and children of another sapling 
exists. Formally, suppose that we are considering similar- 
ity between leaf l^ of sapling A and root r^ of sapling B. 
There might be some Z^_^j of A similar to if oi B. Con- 
sider Figure [2] (c) . Suppose that we have already merged 
uk saplings. Now, there are two saplings uk — > {scotland, 
glasgow, edinburgh, london} and Scotland —J- {glasgow, 
Shetland}, and we would like to merge the two scotlands. 
Since both uk and Scotland saplings have glasgow in com- 
mon, and the user placed glasgow under uk instead of Scot- 
land, this shortcut contributes to the similarity between 
Scotland nodes. The structural similarity between leaf and 
root nodes that takes this type of shortcut into consideraion 



struct SimLR{li ,r ) — structSimRR{r ,r ). (4) 

Specifically, this is simply the root-to-root structural simi- 
larity of r and r^, which measures overlap between siblings 
of li and children of r^ . For the case when there is no short- 
cut, the similarity from this part will be dropped out; hence, 
the Eq. [T] will only be based on the local similarity. 

4.3 SAP: Growing a Tree by Merging Saplings 

We describe sap algorithm, which uses operations defined 
above to incrementally grow a deeper, bushier tree by merg- 
ing saplings created by different users. In order to learn 
a folksonomy corresponding to some concept, we start by 
providing a seed term, the name of that concept. The seed 
term will be the root of the learned tree. We cluster indi- 
vidual saplings whose roots have the same name as the seed 
by using the similarity measures Eq. [T] Eq. [2] and Eq. [3] to 
identify similar saplings. Saplings within the same cluster 
are merged into a bigger sapling using the mergeByRoot{, ) 
operation. Each merged sapling corresponds to a different 
sense of the seed term. 

Next, we select one of the merged saplings as the starting 
point for growing the folksonomy for that concept. For each 
leaf of the initial sapling, we use the leaf name to retrieve all 
other saplings whose roots are similar to the name. We then 
merge saplings corresponding to different senses of this term 
as described above. The merged sapling whose root is most 
similar to the leaf (using similarity measures Eq.[TJ Eq.[2]and 
Eq. H}, is then linked to the leaf. In the case that several 
saplings match the leaf, we merge all of them together before 
linking. Clustering saplings into different senses, and then 
merging relevant saplings to the leaves of the tree proceeds 
incrementally until some threshold is reached. 

Suppose we start with saplings shown in Figure [2jc), and 
the seed term is uk. The process will first cluster uk saplings. 
Suppose, for illustrative purposes, that there is only one 
sense of uk, resulting in a single sapling with root uk. Next, 
the procedure selects one of the unlinked leaves, say glas- 
gow, to work on. All saplings with root glasgow will be clus- 
tered, and the merged glasgow sapling that is sufficiently 
similar to the glasgow leaf of the uk sapling will then be 
linked to it at the leaf, and so on. 

Handling Shortcuts. Attaching a sapling A to the learned 
tree F can result in structural inconsistencies in F. One type 
of inconsistency is a shortcut, which arises when a leaf of A 
is similar to a leaf of F. In the illustration above, attaching 
the Scotland sapling to the uk tree will generate a shortcut. 




Figure 3: Appearance of mutual shortcuts between 
London and England when merging London and 
England saplings. To resolve them, we compare 
the similarity between UK-London and UK-England 
sapling pairs. Since England sapling is closer to 
UK than London sapling, we simply attach England 
sapling to the tree; w^hile ignoring London leaf under 
UK. 
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shorter path and keep the longer one which captures more 
specific knowledge. 

There are cases where the decision to drop the shorter 
path cannot be made immediately. Suppose we have uk -^ 
{london, england, Scotland} as the current learned tree, 
and are about to attach london — > {british museum, dock- 
land, england} to it. Unfortunately, some users placed eng- 
land under london, and attaching this sapling will create a 
shortcut to england. The decision to eliminate the shorter 
path to england cannot be made at this point, since we have 
no information about whether attaching the england sapling 
will also create a shortcut to london from the root (uk). We 
have to postpone this decision until we retrieve all relevant 
saplings that can be attached to the present leaf {Vj'ondon) 
and its siblings {l'^^g,^„a and /"^''otiand)- 

Suppose that l^ngiand does match the root of sapling eng- 
land — !> {london, manchester, liverpool}. Mutual short- 
cuts to england and london would undesirably appear once 
all the saplings are attached to the tree. Hence, the decision 
to drop lengiand ov Ifondon must be made. We base the deci- 
sion on similarity. Intuitively, a sapling that is more similar, 
or "closer," to r"*" should be linked to the tree. Formally, the 
node to be kept is Z^*^, where i; — argmaxx{nodesim{r^^ ,r^)} 
and X — {england, london}, while the other will be dropped. 
This is illustrated in Figure [3] 

Handling Loops. Attaching a sapling to a leaf of the learned 
tree may result in another undesirable structure, a loop. 
Suppose that we are about to attach a sapling A to the 
leaf l[ of F. A loop will appear if there exists a leaf Ij of 
A with the same name as some node in the path from root 
to l[ in F. In order to make the learned tree consistent, we 
must remove if before attaching the sapling. For instance, 
suppose we decide to attach london sapling to the england 
sapling in Figure [3] at its london node, we have to remove 
england node of london sapling first. 

In some cases, loops indicate synonymous concepts. In 
our data set, we found that there are users who specify 
the relation animal — >■ fauna, and those who specify the in- 
verse fauna — >■ animal. Since animal and f arnia have similar 



meaning, we hypothesize that this conflict appears because 
of variations in users' expertise and categorization prefer- 
ences. 

To determine whether a loop is caused by a synonym, 
we check the similarity between r^ and r^ . If it is high 
enough, we simply remove if from F, for which name{lf) = 
name{lj ); then, merge r and r^ . The similarity mea- 
sure is based on Eq. [T] More stringent criteria are required 
since r and r^ have different names. Specifically, we mod- 
ify tagSim{X, Y) to tagSim''^"{X, Y), which instead evalu- 



ates 



min(\Tx\,\TY\) ' 



and modify structSim{X,Y) to struct- 



Sim"^'^{X, Y), which only evaluates -g- X^i ^ S{name{lf), 
name{ljj). 

Mitigating Noisy Vocabularies. As mentioned in Sectional 
noisy nodes appear from idiosyncratic vocabularies, used by 
a small number of users. For a certain merged sapling, we 
can identify these nodes by the number of users who speci- 
fied them. Specifically, we use 1% of the number of all users 
who "contribute" to this merged sapling as the threshold. 
We then remove leaves of the sapling, that are specified by 
fewer number of users than the threshold. 

Managing Complexity. Computing the similarity measure 
for all pairs of saplings in the corpus is impractical, even 
considering local or structural similarity only. We address 
this scalability issue in two ways. First, we only compare 
sapling nodes if they share the same (stemmed) name. This 
reduces the total number of pairs which need to be com- 
pared and eliminates the need to compute nameSim{,) in 
Eq. (2] Second, we apply the blocking approach |llj for effi- 
ciently computing similarity and merging sapling roots. The 
basic idea behind this approach is to first use a cheap sim- 
ilarity measure to "roughly" group similar items. We can 
then thoroughly compute item similarities and merge them 
within each "roughly similar" group by using the more com- 
putationally expensive similarity measure. We assume that 
items judged to be dissimilar by the cheap measure will also 
be dissimilar when evaluated by the more expensive mea- 
sure. Since the approach applies the expensive measure to 
a much smaller set of items, it reduces the time complexity 
of the clustering method. 

In our case, we compute an inexpensive similarity measure 
based on the most frequent tags. Specifically, we map the 
top tags to some integer code, which can be cheaply sorted 
by any database. Subsequently, we use the database to sort 
saplings by their codes, moving roughly similar saplings to 
neighboring rows. The process begins by scanning sorted 
saplings in the database table on a sapling by sapling basis. 
If the presently scanned sapling has not been merged with 
some other sapling, we add this sapling to the top of the 
queue. If the present sapling does belong to some merged 
sapling, we check if this sapling is also similar to some other 
merged saplings in the queue. We use Eq. [l] Eq. [2]and Eq.[3] 
to evaluate their similarity. If they are similar enough, we 
will merge them together into a new merged sapling; then 
add it to the top of the queue. The scanning is performed 
repeatedly until the number of merged saplings no longer 
changes. 

4.4 Complexity Analysis 

Here we sketch the computational complexity of sap. Ba- 



sically, SAP can be decomposed into 2 different parts: (1) 
root-to-root merging, which expands folksonomies' width; 
(2) leaf-to-root merging, which extends folksonomies' depth. 
These two parts are loosely dependent, i.e., one can cluster 
all saplings into different senses; then "vertically" merge the 
root of one sapling sense with a leaf of the other. Since 
we use blocking and only cluster saplings with the same 
stemmed names, the computational complexity depends on 

(1) the number of unique stemmed names in the data set; 

(2) the average number of saplings that share a name. Let 
TV and M be the number of nodes and the number of unique 
stemmed names in the data set respectively. Hence, for each 
stem, there are jj nodes to be compared on average. We 
use database to first roughly sort saplings, which generally 
requires 0{j^log(j^)). After saplings are sorted, they are 
scanned and merged. This is repeatedly, say in i iterations, 
until the number of clusters no longer changes, which re- 
quires 0{i X ■^). In all, the complexity of the first part 
is 0{Nlog{j;j) -\- iN). Empirically, the number of clusters 
converges in 2-3 iterations on average. 

Let b and d be the branching factor and the depth of the 
tree we want to produce. In addition, suppose that there are 
s sapling senses for each stemmed name on average. Since 
we have to traverse each inner node of the tree to attach 
relevant sapling senses, and for each of these nodes we need 
to compare the similarity to all sapling senses with similar 
root names, this requires 0{s x V^). 

Our earlier work, sig [TS], which is described in more de- 
tail in Section [5] only considered the best path from a root 
to a given leaf of the tree, and required enumerating all pos- 
sible paths between them. In the best case, when there are 
no shortcuts or loops in the data set, the number of paths 
from the root to all leaves of a given tree is equal to the 
number of the leaves, and that only requires Oih"^ + h ~^) to 
check whether each edge should be included. In the worst 
case, when shortcuts appear to all node pairs, we would 
need 0(( ^'^) x h"^) to check all possible edges. Moreover, 
we also need to enumerate all possible paths for the root to 
all leaves of the tree, which requires 0(1 -f 5I^e=i.d_i Ce^)) 
per root-to-leaf pair. Hence, we expect our approach to scale 
better than the previous one as the depth of the output tree 
increases and when there are many shortcuts. 

5. EMPIRICAL VALIDATION 

We constructed a data set containing collections and their 
constituent sets (or collections) created by a subset of Flickr 
users who are members of seventeen public groups devoted 
to wildlife and nature photography [T5]. These users had 
many other common interests, such as travel and sports, 
arts and crafts, and people and portraiture. We extracted 
all the tags associated with images in the set, and retrieved 
all other images that the user annotated with these tags. We 
constructed personal hierarchies, or saplings, from this data, 
with each sapling rooted at one of user's top-level collections. 
For reasons described in Section 14.11 we ignore collections 
with composite names. This reduces the size of the data set 
to 20, 759 saplings created by 7, 121 users. A small number 
of these saplings are multi-level. 

The folksonomy learning approach described in this pa- 
per has a number of parameters as shown in Table [T] In our 
experiment, we ignored the parameter jS since only sapling 
nodes with the same name are needed to be compared as de- 
scribed the previous section. To explore the range of these 




Figure 4: Folksonomies learned for bird and sport 



parameters, we set up a small experiment by first select- 
ing 5 different seed term^; then running the approach with 
different values. Optimal parameter values would enable 
the approach to reasonably combine/separate saplings with 
similar/different senses. We manually inspected the induced 
folksonomies to check how the saplings were merged/separated. 

The parameter K allows the approach to consider only 
top frequency tags, which tend to be more stable and less 
noisy [5]. Nevertheless, the top tags will not contain enough 
information if the number is set too low, e.g., K = 10. At 
the fixed values of the common tag threshold, J = 4, and 
the structural-local weight combination, urr = 0.1 (in this 



ski, bird, victoria, africa and insect 



Parameters 


Description 


K 


The number of top frequent tags 


J 


The number of common tags for tag similarity 


OiRR 


The weight combination of local and structural 
similarity for computing root-to-root similarity 


OiLR 


The weight combination of local and structural 
similarity for computing leaf-to-root similarity 


P 


The weight combination of name and tag similar- 
ity (not required in our experiment) 


r 


The similarity threshold 



Table 1: Parameters of the folksonomy learning ap- 
proach. 



case, we simply evaluated on merging root-root nodes; hence 
there is no need for qlh), we found that the approach per- 
forms reasonably well when the value of K is around 30-60, 
while the performance starts to degrade for K > 60. Smaller 
values of J leads to a weak tag similarity measure, which, in 
turn, mistakenly causes the approach to merge saplings with 
different senses. Large J will be relatively stringent, and as 
a result, saplings of the same sense will not be merged. We 
found that, at K = 40, the value of J between 4 to 6 allows 
reasonable results. 

For Qflfl and qlh, the weight combination between local 
and structural similarity for root-root and leaf-root nodes in 
Eq. [T] the larger the values the more the similarity measure 
emphasizes on the structural similarity. From our experi- 
ments, we found that the structure information is very infor- 
mative. When ana is set to a very large value or the max- 
imum, 1.0, the approach clusters "structure-rich" saplings, 
i.e., saplings containing many children, reasonably well. For 
leaf-to-root merging or in situations where structural infor- 
mation is uncommon, local similarity becomes more impor- 
tant. We discovered that at qhh = 0.1 and aLR = 0.8, the 
approach produces reasonable folksonomies. Due to space 
limitations, we do not include the complete set of results. 
Here, we report the parameter values that resulted in good 
performance: we set K = 40; J = 4. In addition, since all 
similarity measures are normalized to range within 0.0 and 
1.0, we set T = 0.5. 

We compare sap against the folksonomy learning method, 
siG, described in [12]. Briefly, sig first breaks a given sapling 
into (collection-set) individual parent-child relations. With 
the assumption that the nodes with the same (stemmed) 
name refer to the same concept, the approach employs hy- 
pothesis testing to identify the informative relations, i.e., 
checking if the relation is not generated at random. Infor- 
mative relations are then linked into a deeper folksonomy. 
We used a significance test threshold of 0.01. 

5.1 Methodology 

We quantitatively evaluate the induced folksonomies by 

(1) automatically comparing them to a reference hierarchy; 

(2) structural evaluation; (3) manual evaluation. 
Evaluation against the reference hierarchy: We use 
the reference hierarchy from the Open Directory Project 
(ODP)|j We selected ODP because, in contrast to Word- 
Net, ODP is generated, reviewed and revised by many reg- 
istered users. These users seem to use more colloquial terms 
than appear in WordNet. In addition, like Flickr users, they 
specify less formal relations, mainly broader/narrower rela- 
tions. WordNet, on the other hand, specifies a number of 
formal relations among concepts, including hypernymy and 
meronymy. 

We use methodology described in [T^] to automatically 
evaluate the quality of the learned folksonomies. Although 
ODP and saplings are generated from different sources, there 
is substantial vocabulary overlap that makes them compa- 
rable. Since the ODP hierarchy is relatively large and com- 
posed of many topics, we had to carve out the "relevant" 
portion for comparison. First, we specified a seed, 5", which 
is the root of the learned folksonomy F and the reference 
hierarchy to which it is compared. 

Next, the folksonomy is expanded two levels along the 
relations in F. The nodes in the second level are added as 
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leaf candidates, LC. If the spanning stops after one level, 
we also add this node's name to LC. Given S and LC, 
we identify leaf candidates, LCD, that also appear in ODP, 
D. AH paths from S to LCD in D constitute the reference 
hierarchy for the seed S. 

Next, S is used as seed for learning the folksonomy asso- 
ciated with this concept. In siG, S and LC are both used to 
learn the folksonomy. The maximum depth of learned trees 
is limited to 4. The metrics to compare the learned folk- 
sonomies to the reference are Lexical Recall [8] and the mod- 
ified Taxonomic Overlap defined in [12], mTO. Lexical Re- 
call measures the overlap between the learned and reference 
taxonomies, independent of their structure. mTO measures 
the quality of structural alignment of the taxonomies. Here, 
we report the harmonic mean, fmTO, instead, because of 
mTO's asymmetry. Since the proposed approach generates 
bushy folksonomies whose leaf nodes may not appear in the 
reference taxonomy, the mTO metric may unfairly penalize 
the learned folksonomy. Instead, we only consider the paths 
of the learned folksonomy that are comparable to the refer- 
ence hierarchy. Specifically, for each leaf I in LCD, we select 
the path S ^ I in the learned folksonomy and compare it to 
one in the reference hierarchy. If there are many comparable 
paths existing in the reference, we select the one that has 
the highest LR to compare. 

Structural evaluation: Ideally, we prefer an approach 
that generates bushier and deeper trees. The scope of con- 
cepts in such trees are broadly enumerated (tree width); 
while, each concept is subcategorized in enough detail (tree 
depth). Although one can use an average depth of a tree 
and branching factor, it is difficult to justify which trees are 
better overall since these metrics are independent. A very 
bushy tree may have only 1 level depth; meanwhile, a very 
deep tree may have a chain-like structure. In this work, 
we define a simple, yet intuitive measure. Area Under Tree 
(AUT), which takes both tree bushiness and depth into ac- 
count. To calculate AUT for a certain tree, we compute the 
distribution of the number of nodes in each level and then 
compute the area under the distribution. Intuitively, trees 
that keep branching out at each level will have larger AUT 
than those that are short and thin. Suppose that we have a 
tree with one node at the root, three nodes at 1"* level and 
four at 2"''. With the scale of tree depth set to 1.0, AUT of 
this tree would be 0.5 x (1 -I- 3) + 0.5 x (3 -f 4) = 5.5 (a sum 
of trapezoids). 

Manual evaluation: We use 3 human subjects to evaluate 
the portions of induced folksonomies which were not compa- 
rable to ODP hierarchy. We randomly selected 10% of the 
paths (all of them if there are fewer than 10 paths in the 
learned folksonomy) that are not in the reference hierarchy 
and asked three judges to evaluate them. If a portion of the 
path is incorrect, either because an incorrect concept ap- 
pears or the ordering of concepts is wrong, the judges were 
asked to mark it incorrect, otherwise it is correct. They 
can also mark the path "unsure" if there is not enough evi- 
dence for a decision. A path's label is based on the majority 
decision. If there is no agreement, or the path is marked 
uncertain by all judges, we exclude it. 

5.2 Results 

In Table [2l we compare the quality of the folksonomy 
learned for each seed by sap, and the earlier work, sig. sap 
generally recovers a larger number of concepts, relative to 



Approach 


Incorrect Path 


SAP 


anim/other anim/mara 


SAP 


world/landscap/architectur /Scarborough 


SAP 


world/scotland/through vicwfind 


SAP 


curop/franc/flight to 


SIG 


anim/pct/chester/chcstcr zoo 


SIG 


bird/turkei/antalya 


SIG 


bird/turkei/cphcsu 


SIG 


fauna/underwat/dcstin 


SIG 


south africa/safari/isla pauhno 


SIG 


south africa/safari/la fiorc 


SIG 


sport /golf /ad amst 


SIG 


sport /ski/cloud/othcr /now year 


SIG 


world/canada/ victoria/ melbourn 



Table 3: The table lists all incorrect paths caused 
by possibly ambiguous nodes, which are in bold. 



ODP, as indicated by the numbers of overlapping leaves (in 
90% of the cases) and better LR scores (in 76% of the cases). 
Moreover, sap can produce trees with higher quality, rela- 
tive to the ODP, as indicated by fmTO score (in 68% of 
the cases). From the structural evaluation, sap produced 
bushier trees as indicated by AUT in 87% of the cases. In 
addition, the average depth (not shown in the Table) from 
roots to all leaves of the trees over all cases generated by sap 
is deeper than siG (2.68 vs. 2.37). 

Although the manual evaluation suggests that both ap- 
proaches can induce about the same quality on the paths 
that are uncomparable to ODP, after closely inspecting the 
learned trees, we found that sap demonstrates its advantage 
over SIG in disambiguating and correctly attaching relevant 
saplings to appropriate induced trees. For instance, bird 
tree produced by sap does not includes Istanbul or other 
Turkey locations, as shown in Figure U In the sport tree, 
SAP does not include any concept about the sky (Note that 
skies and skiing share common name). In addition, there 
are no concepts about irrelevant events like birthdays and 
parades appearing in the tree. There are some cases, e.g., 
dog and cat, where we could not compute the hand labeling 
scores because these trees often contained pet names, rather 
than breeds. 

We further considered how many of the incorrect paths 
are caused by node ambiguity. To do so, we first identi- 
fied ambiguous terms, and checked to see how many of the 
incorrect paths contain these terms. Although it is not obvi- 
ous how to automatically identify ambiguous terms, we use 
the following heuristic to determine the possible ambigui- 
ties: for a given leaf of the induced tree, if many different 
merged senses exist (i.e., > 10), then we consider the leaf 
ambiguous. During the tree induction process, we keep track 
of these nodes and the root. Subsequently, we use the am- 
biguous terms and their root names to check the accuracy 
of paths in the hand labeled data containing them. As pre- 
sented in Table [3l there is about a half reduction in error for 
ambiguous paths using sap. This supports our claim about 
superiority of sap on node disambiguation. 

In all, the proposed approach, sap, has several advantages 
over the baseline, siG. First, it exploits both structure in- 
formation and tag statistics to combine relevant saplings, 
which can produce more comprehensive folksonomies as well 
as resolve ambiguity of the concept names. Second, it al- 



seeds 


Whole folksonomics 


Comparison with ODP 


Manual 


#leaves 


AUT 


:^Ovlp IVS 


fmTO 


LR 


AUT 


Ace (10%) 


Slg 


sap 


Slg 


sap 


Slg 


sap 


Slg 


sap 


Slg 


sap 


Slg 


sap 


Slg 


sap 


anim 


268 


583 


694.0 


1076.0 


68 


92 


0.602 


0.659 


0.281 


0.360 


160.0 


189.5 


0.89 


0.74 


bird 


73 


103 


84.5 


113.5 


20 


22 


0.760 


0.755 


0.281 


0.315 


21.5 


28.5 


0.60 


1.00 


invertcbr 


11 


15 


15.5 


19.5 


3 


1 


0.762 


1.000 


0.250 


0.125 


4.5 


1.5 


1.00 


1.00 


vcrtcbr 


80 


114 


162.5 


236.5 


1 





1.000 


n/a 


0.600 


0.200 


2.5 


n/a 


1.00 


1.00 


insect 


29 


44 


35.5 


61.5 


5 


5 


0.924 


0.924 


0.857 


0.857 


6.5 


6.5 


1.00 


1.00 


fisli 


7 


6 


7.5 


6.5 








n/a 


n/a 


0.016 


0.016 


n/a 


n/a 


1.00 


1.00 


plant 


110 


194 


265.5 


426.0 


6 


7 


0.613 


0.735 


0.250 


0.273 


13.0 


11.5 


0.67 


1.00 


flora 


64 


403 


173.0 


1048.5 


6 


18 


0.483 


0.481 


0.130 


0.407 


16.0 


84.0 


1.00 


1.00 


fauna 


141 


609 


420.0 


1146.0 


9 


31 


0.463 


0.490 


0.113 


0.212 


27.0 


71.5 


0.91 


0.85 


flower 


112 


169 


210.5 


226.5 


1 


1 


0.379 


1.000 


0.267 


0.250 


3.5 


1.5 


1.00 


n/a 


rcptil 


3 


4 


4.5 


4.5 


2 


3 


0.625 


0.622 


0.500 


0.667 


2.5 


3.5 


n/a 


n/a 


amphibian 


1 


1 


1.5 


1.5 


1 


1 


1.000 


1.000 


1.000 


1.000 


1.5 


1.5 


n/a 


n/a 


build 


7 


23 


11.5 


37.5 








n/a 


n/a 


1.000 


1.000 


n/a 


n/a 


1.00 


1.00 


urban 


6 


80 


15.0 


145.5 








n/a 


n/a 


0.071 


0.071 


n/a 


n/a 


1.00 


1.00 


eountri 


378 


1605 


798.5 


4504.0 


2 


4 


0.447 


0.665 


0.143 


0.214 


8.0 


8.5 


1.00 


1.00 


africa 


53 


71 


90.5 


119.5 


23 


27 


0.773 


0.895 


0.508 


0.547 


37.5 


40.5 


1.00 


1.00 


asia 


187 


284 


389.0 


631.5 


80 


85 


0.734 


0.788 


0.396 


0.484 


165.5 


168.5 


1.00 


1.00 


europ 


379 


1073 


916.0 


2706.5 


165 


301 


0.619 


0.670 


0.236 


0.418 


369.0 


874.5 


1.00 


0.94 


south africa 


12 


17 


15.5 


18.5 


3 


3 


0.431 


0.600 


0.444 


0.444 


3.5 


3.5 


0.78 


1.00 


north america 


166 


731 


435.0 


2203.5 


67 


118 


0.545 


0.576 


0.165 


0.319 


170.5 


361.5 


1.00 


0.92 


south america 


32 


50 


54.5 


101.5 


12 


15 


0.706 


0.832 


0.415 


0.463 


20.5 


28.5 


1.00 


1.00 


central america 


27 


8 


53.5 


12.5 


1 


2 


0.631 


0.754 


0.417 


0.500 


2.5 


4.5 


1.00 


1.00 


unit kingdom 


106 


267 


274.5 


658.5 


31 


82 


0.787 


0.724 


0.099 


0.127 


71.5 


179.5 


1.00 


1.00 


unit state 


102 


375 


217.0 


936.5 


35 


55 


0.620 


0.749 


0.130 


0.256 


74.5 


122.0 


1.00 


1.00 


world 


545 


3177 


1437.0 


9235.0 


191 


475 


0.476 


0.461 


0.085 


0.215 


490.0 


1676.5 


0.97 


0.96 


citi 


123 


448 


234.0 


927.5 








n/a 


n/a 


0.111 


0.100 


n/a 


2.5 


1.00 


1.00 


craft 


5 


1 


10.5 


1.5 


1 





0.603 


n/a 


0.056 


0.050 


2.5 


n/a 


1.00 


n/a 


dog 


15 


26 


17.5 


28.5 





1 


n/a 


1.000 


0.045 


0.080 


n/a 


1.5 


n/a 


n/a 


cat 


11 


39 


13.5 


41.5 








n/a 


n/a 


0.100 


0.100 


n/a 


n/a 


n/a 


n/a 


sport 


207 


74 


407.0 


86.5 


19 


27 


0.693 


0.647 


0.091 


0.084 


30.0 


31.5 


0.28 


1.00 


australia 


47 


83 


71.0 


147.5 


12 


27 


0.354 


0.665 


0.123 


0.216 


14.5 


36.5 


0.67 


1.00 


Canada 


55 


763 


128.0 


2502.0 


11 


27 


0.620 


0.587 


0.158 


0.241 


21.5 


75.5 


1.00 


1.00 



Table 2: This table presents empirical validation on folksonomies induced by the proposed approach, sap, 
comparing to the baseline approach, slg. The first column group presents properties of the whole induced 
trees: the number of leaves and Area Under Tree(AUT). The second column group reports the quality of 
induced trees, relatively to the ODP hierarchy. The metrics in this group are modified Taxonomic Overlap 
{fmTO) (averaged using Harmonic Mean), Lexical Recall (LR), where their scales are ranging from 0.0 to 
1.0 (the more the better), as AUT is computed from portions of the trees, which are comparable to ODP. 
"ij^ovlp Ivs" stands for a number of overlap leaves (to ODP). The last column group reports performance 
on manually labeled portions of the trees, w^hich do not occur in ODP. In some cases, "n/a" exists since w^e 
cannot compute its corresponding value. 



lows similar concepts to appear multiple times within the 
same hierarchy. For example, sap allows the anim folkson- 
omy to have both anim — >■ pet — S- cat and anim — S- mammal 
— ^ cat paths, while only one of these paths is retained by 
siG. Last, SAP can identify synonyms from structure (loops). 
We learned the following synonyms from Flickr data: {anim, 
creatur, critter, all anim, wildlife} and {insect, bug}. 

6. RELATED WORK 

Constructing ontological relations from text has long in- 
terested researchers, e.g., 6, 16, 19,. Many of these methods 
exploit linguistic patterns to infer if two keywords are related 
under a certain relationship. However, these approaches are 
not applicable to social metadata because it is usually un- 
grammatical and much more inconsistent than natural lan- 
guage text. 

Several researchers have investigated various techniques 
to construct conceptual hierarchies from social metadata. 
Most of the previous work utilizes tag statistics as evidence. 
Mika [ID] uses a graph-based approach to construct a net- 
work of related tags, projected from either a user-tag or 
object-tag association graphs; then induces broader/narrower 
relations using betweenness centrality and set theory. Other 



works apply clustering techniques to tags, and use their co- 
occurrence statistics to produce conceptual hierarchies [3]. 
Heymann and Garcia-Molina 7] use centrality in the simi- 
larity graph of tags. The tag with the highest centrality is 
considered more abstract than one with a lower centrality; 
thus it should be merged to the hierarchy first, to guarantee 
that more abstract nodes are closer to the root. Schmitz [15] 
applied a statistical subsumption model [14] to induce hier- 
archical relations among tags. Since these works are based 
on tag statistics, they are likely to suffer from the "popular- 
ity vs. generality" problem, where a tag may be used more 
frequently not because it is more general, but because it is 
more popular among users. 

Our present work, sap, is different from our earlier ap- 
proach, SIG [T2] in many aspects. First, sap exploits more 
evidence, i.e., structure and tag statistics of personal hierar- 
chies rather than individual relations' co-occurrence statis- 
tics as in SIG. Second, sap is based on the relational cluster- 
ing approach that incrementally attaches relevant saplings 
to the learned folksonomies, as sig exhaustively determines 
the best path out of all possible paths from the root node to 
a leaf, which is computationally expensive when the learned 



folksonomies are deep. Last, sap demonstrates many advan- 
tages as presented in Section [5] 

The sapling merging approacii described in this paper is 
an extension of collective relational clustering approach used 
for entity resolution II. That work proposed a method to 
identify and disambiguate entities, such as authors, that uti- 
lizes two types of evidence: intrisic and extrinsic features. 
Intrinsic features are associated with specific instances, such 
as author names, while extrinsic features are derived from 
structural evidence, e.g., co-authors in a citations database. 
Intuitively, two names refer to the same author if they are 
similar and their co-author names refer to the same set of 
authors. Analogously, we identify and disambiguate con- 
cept names from names and tags (intrinsic) and neighboring 
nodes' features (extrinsic). However, for efficiency reasons, 
we use the naive version of the relational clustering, where 
we directly use the features from neighbors as the extrinsic 
features, rather than cluster labels. 

Handling mutual shortcuts by keeping the sapling which 
is more similar to the ancestor is similar in spirit to the 
minimum evolution assumption in [19]. Specifically, a cer- 
tain hierarchy should not have any sudden changes from a 
parent to its child concepts. Our approach is also similar to 
several works on ontology alignment (e.g. [11[T7]). However, 
unlike those works, which merge a small number of deep, 
detailed and consistent concepts, we merge large number of 
noisy and shallow concepts, which are specified by different 



7. CONCLUSION 

This paper describes an approach which incrementally 
combines a large number of shallow hierarchies specified 
by different users into common, denser and deeper folk- 
sonomies. The approach addresses the challenges of learning 
folksonomies from social metadata and demonstrates sev- 
eral advantages over the previous work. Additionally, it is 
general enough for other domains, such as tags/bundles in 
Delicious and files/folders in personal workspaces. 

For the future work, in addition to automatically sepa- 
rating broader/narrower from related-to relations, we would 
like to develop a systematic way to handle individual saplings 
whose child nodes are from different facets. This will im- 
prove the quality of the learned folksonomies by not mix- 
ing concepts from different facets. We are also working on 
combining more sources of evidence such as geographical 
information for learning accurate folksonomies. Lastly, we 
would like to frame the approach in a fully probabilistic way 
(e.g., |18l [2]), which provides a systematic way to combine 
heterogeneous evidence, and takes into account uncertain- 
ties on similarities between concepts and relations. 
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